Improving KNN Arabic Text Classification with N-Grams Based Document Indexing

نویسندگان

  • Riyad Al-Shalabi
  • Rasha Obeidat
چکیده

Text classification is the task of assigning a document to one or more of pre-defined categories based on its contents. This paper presents the results of classifying Arabic language documents by applying the KNN classifier, one time by using N-Gram namely unigrams and bigrams in documents indexing, and another time by using traditional single terms indexing method (bag of words) which supposes that the terms in the text are mutually independent which is not the case. Results show that using N-Grams produces better accuracy than using Single Terms for indexing; the average accuracy of using N-grams is .7357, while with Single terms indexing the average accuracy is .6688.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving K-Nearest Neighbor Efficacy for Farsi Text Classification

One of the common processes in the field of text mining is text classification.Because of the complex nature of Farsi language, words with separate parts and combined verbs, the most of text classification systems are not applicable to Farsi texts.K-Nearest Neighbors (KNN) is one of the most popular used methods for text classification and presents good performance in experiments on different d...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Unordered N-gram Representation Based on Zero-suppressed BDDs for Text Mining and Classification

In this paper, we present a new method to analyze unordered n-grams by using ZBDDs (Zero-suppressed BDDs). n-grams have been used not only for text analysis but also for text indexing in some search engines. We newly use a variation of n-grams called unordered n-grams. Unordered n-grams abstract from the position of the characters in each n-gram, i.e., they just deal with the range of ordinary ...

متن کامل

Natural Language Text Classification and Filtering with Trigrams and Evolutionary Nearest Neighbour Classifiers

N grams o er fast language independent multi-class text categorization. Text is reduced in a single pass to ngram vectors. These are assigned to one of several classes by a) nearest neighbour (KNN) and b) genetic algorithm operating on weights in a nearest neighbour classi er. 91% accuracy is found on binary classi cation on short multi-author technical English documents. This falls if more cat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008